Goto

Collaborating Authors

 video encoder


Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation

Neural Information Processing Systems

Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features.





Low-FidelityVideoEncoderOptimizationfor TemporalActionLocalization

Neural Information Processing Systems

This is why the two-stage optimization pipeline as described above becomes the most common andfeasible choice inpracticeforoptimizing aTALmodel.


Do Blind Spots Matter for Word-Referent Mapping? A Computational Study with Infant Egocentric Video

arXiv.org Artificial Intelligence

Typically, children start to learn their first words between 6 and 9 months, linking spoken utterances to their visual referents. Without prior knowledge, a word encountered for the first time can be interpreted in countless ways; it might refer to any of the objects in the environment, their components, or attributes. Using longitudinal, egocentric, and ecologically valid data from the experience of one child, in this work, we propose a self-supervised and biologically plausible strategy to learn strong visual representations. Our masked autoencoder-based visual backbone incorporates knowledge about the blind spot in human eyes to define a novel masking strategy. This mask and reconstruct approach attempts to mimic the way the human brain fills the gaps in the eyes' field of view. This represents a significant shift from standard random masking strategies, which are difficult to justify from a biological perspective. The pre-trained encoder is utilized in a contrastive learning-based video-text model capable of acquiring word-referent mappings. Extensive evaluation suggests that the proposed biologically plausible masking strategy is at least as effective as random masking for learning word-referent mappings from cross-situational and temporally extended episodes.



Video CLIP Model for Multi-View Echocardiography Interpretation

arXiv.org Artificial Intelligence

Echocardiography records ultrasound videos of the heart, enabling clinicians to assess cardiac function. Recent advances in large-scale vision-language models (VLMs) have spurred interest in automating echocardiographic interpretation. However, most existing medical VLMs rely on single-frame (image) inputs, which can reduce diagnostic accuracy for conditions identifiable only through cardiac motion. In addition, echocardiographic videos are captured from multiple views, each varying in suitability for detecting specific conditions. Leveraging multiple views may therefore improve diagnostic performance. We developed a video-language model that processes full video sequences from five standard views, trained on 60,747 echocardiographic video-report pairs. We evaluated the gains in retrieval performance from video input and multi-view support, including the contributions of various pretrained models. Code and model weights are available at https://github.com/UTcardiology/video-echo-clip



Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Y uchong Sun

Neural Information Processing Systems

Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pre-training mainly focus on short-form videos (i.e., within 30 seconds) and sentences,